A Cross-language Approach to Rapid Creation of New Morpho-syntactically Annotated Resources

نویسندگان

  • Anna Feldman
  • Jirka Hana
  • Chris Brew
چکیده

We take a novel approach to rapid, low-cost development of morpho-syntactically annotated resources without using parallel corpora or bilingual lexicons. The overall research question is how to exploit language resources and properties to facilitate and automate the creation of morphologically annotated corpora for new languages. This portability issue is especially relevant to minority languages, for which such resources are likely to remain unavailable in the foreseeable future. We compare the performance of our system on languages that belong to different language families (Romance vs. Slavic), as well as different language pairs within the same language family (Portuguese via Spanish vs. Catalan via Spanish). We show that across language families, the most difficult category is the category of nominals (the noun homonymy is challenging for morphological analysis and the order variation of adjectives within a sentence makes it challenging to create a realiable model), whereas different language families present different challenges with respect to their morpho-syntactic descriptions: for the Slavic languages, case is the most challenging category; for the Romance languages, gender is more challenging than case. In addition, we present an alternative evaluation metric for our system, where we measure how much human labor will be needed to convert the result of our tagging to a high precision annotated resource.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Building an old Occitan corpus via cross-Language transfer

This paper describes the implementation of a resource-light approach, cross-language transfer, to build and annotate a historical corpus for Old Occitan. Our approach transfers morpho-syntactic and syntactic annotation from resource-rich source languages, Old French and Catalan, to a genetically related target language, Old Occitan. The present corpus consists of three sub-corpora in XML format...

متن کامل

Morpho-syntactically Annotated Amharic Treebank

In this paper, we describe an ongoing project of developing a treebank for Amharic. The main objective of developing the treebank is to use it as an input for the development of a parser. Morphologically-rich Languages like Arabic, Amharic and other Semitic languages present challenges to the state-of-art in parsing. In such language morphemes play important functions in both morphology and syn...

متن کامل

Evaluating Cross-Language Annotation Transfer in the MultiSemCor Corpus

In this paper we illustrate and evaluate an approach to the creation of high quality linguistically annotated resources based on the exploitation of aligned parallel corpora. This approach is based on the assumption that if a text in one language has been annotated and its translation has not, annotations can be transferred from the source text to the target using word alignment as a bridge. Th...

متن کامل

Portable Language Technology: a Resource-light Approach to Morpho-syntactic Tagging

Morpho-syntactic tagging is the process of assigning part of speech (POS), case, number, gender, and other morphological information to each word in a corpus. Morpho-syntactic tagging is an important step in natural language processing. Corpora that have been morphologically tagged are very useful both for linguistic research, e.g. finding instances or frequencies of particular constructions in...

متن کامل

Exploiting parallel texts in the creation of multilingual semantically annotated resources: the MultiSemCor Corpus

In this article we illustrate and evaluate an approach to create high quality linguistically annotated resources based on the exploitation of aligned parallel corpora. This approach is based on the assumption that if a text in one language has been annotated and its translation has not, annotations can be transferred from the source text to the target using word alignment as a bridge. The trans...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006